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Abstract 

Graphical models are widely used to make inferences concerning interplay in multivariate systems, as 
modeled by a conditional independence graph or network. In many applications, data are collected from 
multiple individuals whose networks may differ but are likely to share many features. Here we present 
a hierarchical Bayesian formulation for joint estimation of such networks. The formulation is general 
and can be applied to a number of specific graphical models. Motivated by applications in biology, we 
focus on time-course data with interventions and introduce a computationally efficient, deterministic 
algorithm for exact inference in this setting. Application of the proposed method to simulated data 
demonstrates that joint estimation can improve ability to infer individual networks as well as differences 
between them. Finally, we describe approximations which are still more computationally efficient than 
the exact algorithm and demonstrate good empirical performance. 

1 Introduction 

Graphical models are widely used to model multivariate systems. Estimation of conditional independence 
structure (often called "network inference" or "structure learning") is increasingly a mainstream approach, 
for example in computational biology. Given data X a network estimator gives an estimate G(X) of the 
conditional independence graph G. The type of graph and its scientific interpretation depend on the model 
and scientific context. 

In many applications, data is collected on multiple individuals j G J that may differ with respect to 
interplay between variables, such that corresponding conditional independence graphs G 3 may be individual- 
specific. For example, in biology, individuals may correspond to different patients or cell lines and the 
networks themselves to gene regulatory or protein signaling networks. Interplay in such networks can depend 
on the genetic and epigenetic state of the individuals, such that even for a well-defined system, such as 
signaling downstream of a certain receptor class, or a sub-part of the transcriptional program, details may 
differ between even closely related samples (Ideker and Krogan, 2012). For example, in yeast signaling, 
edges in the well-understood mitogen-activated protein kinase (MAPK) pathway can change depending on 
context (Zalatan et a/., 2012), whilst in cancer, it is thought that individual cell lines may differ with respect 
to signaling network connections. Continuing reduction in the unit cost of biochemical assays has led to 
an increase in experimental designs that include panels of potentially heterogeneous individuals (Barrctina 
et ai, 2012; Cao et al, 2011; Maher, 2012; The Cancer Genome Atlas Network, 2012). In such settings, 
given individual specific data y 3 , there is scientific interest in the individual specific networks G 3 and their 
similarities and differences. 

The case of multiple related individuals poses a number of statistical challenges for network inference: 

• Efficiency. If the networks share features, then individual-level estimation (i.e. G 3 — G(y 3 )), may 
be inefficient, since there is no sharing of information at the population level. Although individual 
network estimators G 3 may be well-behaved as the individual-specific sample size rij grows large (e.g. 
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Kalisch and Biihlmann (2007)), in practice small-to-moderate rij's and the inherently high-dimensional 
nature of network inference render inference challenging. 

• Data aggregation. Aggregating data from multiple individuals and then performing network infer- 
ence offers a way to obtain larger sample sizes. However, in settings where data from individuals are 
inhomogeneous (in the sense that the graphs may differ between individuals), inferences regarding 
conditional independence structure cannot in general be obtained from aggregated data (Simpson's 
paradox) and testing whether data aggregation is appropriate may be challenging (Pearl, 1998). Esti- 
mating sufficiently homogeneous groups using mixture models and related clustering approaches offers 
an alternative (Zhou et ai, 2009; Mukherjee and Hill, 2011; Rodriguez et ai, 2011; Vu et al, 2012; Hill 
and Mukherjee, 2013), but is challenging in the network setting, as we discuss further below. 

• Ancillary information. Ancillary information may be available both at the "global" (population) 
and "local" (individual) levels. For example, in gene regulation, the biological literature provides 
general information concerning gene-gene interplay, whilst patient-specific characteristics might also 
be available. When such ancillary information is available it may be desirable to include it in inference 
(the "conditionality principle"), but doing so requires care in prior elicitation (Baumbach et ai, 2009; 
Wei and Pan, 2012). 

In this paper we present a Bayesian approach to joint estimation of networks. The high-level formulation 
we propose is general and could be applied to a wide range of graphical model formulations. We present 
a detailed development for the time-course setting, focusing on directed graphical models called Dynamic 
Bayesian Networks (DBNs). These are directed acyclic graphs (DAGs) with explicit time-indices (Murphy, 
2002). The main features of our approach are: 

• Bayesian framework. We use a hierarchical Bayesian model, summarized in Fig. 2. Regularization 
is achieved using priors over both parameters and networks. We focus in particular on regularization 
of individual networks, introducing a latent network G to couple inference across the population. We 
report posterior marginal inclusion probabilities for every possible edge, thus providing a confidence 
measure for the inferred network topologies and offering robustness in settings where posterior mass is 
not highly concentrated on a single model. 

• Computationally efficient estimation from time-course data. For the time-course setting, we 
put forward an efficient and deterministic algorithm. This is done by exploiting modularity of the DBN 
likelihood (Hill et a/., 2012) coupled with a sparsity restriction and a sum-product-type algorithm. In 
moderate-dimensional settings this allows exact joint estimation to be carried out in seconds to minutes 
(we discuss computational complexity below) making our approach suitable for interactive use. 

• Incorporation of ancillary information. We allow for the inclusion of individual-specific ancillary 
information. Following Spencer and Mukherjee (2012) we also allow for interventional data, in which 
time courses are obtained under external intervention on network nodes. 

Joint estimation of graphical models has recently been discussed in the penalized likelihood literature, 
with contributions including Danaher et al. (2012); Guo et al. (2011); Yang et al. (2012). In these studies, 
L\ penalties, such as the fused graphical LASSO, are used to couple together inference of Gaussian graphical 
models (GGMs). Our work complements these efforts by offering a Bayesian formulation of joint estima- 
tion. This facilitates regularization using prior and ancillary network information. Moreover, our approach 
provides a natural way to estimate confidence in the inferred structure, providing robustness in multi-modal 
problems (Claassen and Hcskcs, 2012). Further, we focus on the time-course setting and DBNs rather than 
static data and GGMs. However, we note that unlike the above penalized approaches the Bayesian approach 
we propose is not well-suited to extremely high-dimensional settings with thousands of variables. 

A recent paper by Penfold et al. (2012) considers Bayesian joint estimation for time-course data. Our 
work is in the same vein but differs in two main respects. First, we allow for prior information regarding the 
network structure and ancillary information including individual-specific characteristics. Network priors and 
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ancillary information can usefully constrain inference, not least in biological settings. For example in the 
cancer signaling example we consider below, much is known concerning relevant biochemistry (Fig. 1) and 
individual-specific information pertaining to e.g. mutation status and receptor expression is often available 
(nowadays also in the clinical setting). Second, for the time-course setting, the exact algorithm we propose 
offers massive computational gains in comparison to the approach proposed by Penfold et al. (2012). As we 
discuss in detail below the methodology of Penfold et al. (2012) is prohibitively computationally expensive for 
the applications we consider here. Third, the computational efficiency of our approach allows us to present 
a much more extensive study of joint estimation, using both simulated and real data, than has hitherto been 
possible. This adds to our understanding of the performance of hierarchical Bayesian formulations for joint 
estimation. 

Mixtures of graphical models have been used to explore heterogeneous populations (Zhou et al., 2009; 
Mukherjee and Hill, 2011; Rodriguez et al, 2011; Vu et al, 2012; Hill and Mukherjee, 2013). However, mix- 
ture modeling requires the strong assumption that there exist groups which are (sufficiently) homogeneous 
with respect to model parameters. Otherwise, mixture components are forced to model heterogeneous pop- 
ulations, resulting in potentially poor fit and networks that may not be scientifically meaningful. Moreover, 
while graphical model estimation remains non-trivial, mixtures of graphical models are still more challenging, 
due to a number of factors relating to the hidden nature (and number) of the mixture components. 

Further related work includes Wcrhli and Husmcicr (2008), who propose a Bayesian approach to network 
inference based on multiple, steady-state datasets where in each dataset only a subset of the (shared) 
underlying network is identifiable. Dondclinger et al. (2012) extend the information sharing scheme from 
Werhli and Husmeicr (2008) in the context of inference for time-varying networks. Hoff (2009) considers 
covariance estimation from a heterogeneous population, treating individual covariance matrices as samples 
from a matrix-valued probability distribution. Network priors have been discussed in the literature, including 
Imoto et al. (2003); Mukherjee and Speed (2008); Wei and Pan (2012). Our work differs from these efforts 
by focusing on joint estimation; as we describe below, this leads to a different model structure and prior 
specification. 

The remainder of the paper is organized as follows. In Section 2 we lay out a hierarchical Bayesian 
formulation and in Section 3 we discuss computationally efficient exact inference. Empirical results are 
presented in Section 4 using simulated (Section 4) datasets. Finally we close with a discussion of our 
findings in Section 5. 

2 Joint network inference 

We carry out joint network inference using the hierarchical model shown in Fig. 2 that includes a prior 
network (G°) as well as a latent network (G); each individual network (G 3 ; we use superscript notation when 
referring to a particular individual) is conceptually viewed as a variation upon the latter. Individual data 
y 3 are then conditional upon individual networks. Estimates of the individual networks G 3 are regularized 
by shrinkage towards the common latent network G which in turn may be constrained by an informative 
network prior. Since the latent network is itself estimated, this allows for adaptive regularization. 

2.1 Hierarchical model 

Consider the space Q of (directed) networks (not necessarily acyclic) on the vertex set V = {1, ... ,P}- A 
network G £ Q decomposes over parent sets as G = G\ x • • • x Gp where G p C V are the network parents 
of p £ V . Write Q p for the set of possible parent sets for variable p, such that formally Q = Gi X • • • X Gp- 
Write J = {1, . . . , J} for the set of individuals in the population. 

As shown in Fig. 2, each individual network G 3 is conditional on a latent network G which in turn 
depends on a prior network G° (Section 2.2). As in any graphical model, data y 3 is conditional on graph 
G 3 and parameters 6 3 ; A 3 denotes any ancillary information available on individual j. In this Section we 
describe our general model and network priors, while in Section 3 we discuss the special case of inference for 
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Figure 1: Epidermal growth factor receptor (EGFR) pathway for mammalian cells, characterized by extensive 
biochemistry. [Here edges represent high-level summaries of often complex molecular interactions that may 
involve latent chemical species.] 

time-course data, giving full details of the likelihood for that case. The model is specified by 

p(G|G°,77) ex exp(-r 7< i(G,G )) (1) 

piG 1 , ...,G J \G,X,A 1 ,..., A J ) oc exp I X j d j (G j ,G;A j )\ (2) 

V M I 

where the functionals d? , d : Q x Q — > K and hyperparameters rj, A 1 , . . . , A J must be specified (Section 2.2). 
This formulation is borrowed from statistical mechanics, where d J ,d may be interpreted as energy terms, 
r/, A , . . . , \' J as inverse temperature parameters and Eqns. 1,2 as Boltzmann (or Gibbs) distributions. Taken 
together with a suitable graphical model likelihood p(y 3 |G : ' , we obtain the data-generating model. JNI 
performs inference jointly over (G, G 1 , . . . , G J ), with information sharing occurring via the latent network G. 
The use of a latent network follows Guo et al. (2011); Imoto et al, (2006); Penfold et al. (2012); Werhli and 
Husmcicr (2008). In some biological settings, it may be natural to think of the latent network as describing 
a "wild type network" , however such an interpretation is not required. We refer to this general formulation 
as joint network inference (JNI). 

2.2 Network prior 

Specifying a network prior (Eqn. 1) requires a penalty functional d : Q x Q — > K and a prior network 
G° G Q, with the former capturing how close a candidate network G € Q is to the latter (Imoto et al., 2003; 
Mukherjce and Speed, 2008). We discuss choice of G° below. Given G°, a simple choice of penalty function 
d is the structural Hamming distance d(G,G°) = SHD(G, G°) := J2 P ev |G p AG°|. Here AAB denotes the 
symmetric difference of sets A and B and |A| denotes cardinality of the set A. The hyperparameter 77 
controls the strength of the prior network G° (Eqn. 1). For brevity we follow Penfold et al. (2012) by 
restricting attention to SHD priors, however our formulation is general (see below) and compatible with 
other penalty functionals. For their work on joint estimation of inverse covariance matrices, Danaher et al. 
(2012); Yang et al. (2012) employed the fused graphical LASSO (FGL) penalty, which may be interpreted 
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Figure 2: Joint network inference (JNI): A hierarchical model for analysis of multivariate data from a 
heterogeneous population. [Shaded nodes are unobserved. G° = prior network, G = latent network, G 3 = 
individual j's network, 3 = parameters for individual j, y 3 — data obtained on individual j, A 3 = ancillary 
information available on individual j, 77, A 3 — inverse temperature hyperparameters, <j) 3 — parameters defining 
a prior on O 3 .] 

as a real-valued extension of SHD (strictly speaking, there is no analogue of the latent network G here; FGL 
directly penalizes the difference between individual networks G 3 \G k ). Another interesting extension due to 
Werhli and Husmeier (2008) distinguishes G°\G ("false prior positives") and G\G° ("false prior negatives") 
by allocating a separate inverse temperature hyperparameter for each case. Alternatively, one could employ 
a binomial prior as described in Dondelinger et al. (2012), which provides the same distinction, but allows 
for the hyperparameters of the binomial to be integrated out. 

Conditional on a latent network G, individual networks G 3 are regularized in a similar way, as d 3 (G 3 , G) = 
SHD (G 3 ,G). In their work on combining multiple data sources, Werhli and Husmeier (2008) allow the A 3 
to vary over individuals (data sources) j £ J ', with A 3 reflecting the quality of dataset j. Likewise Penfold 
et al. (2012) learn the A 3 on an individual by individual basis. However, in both studies, hyperparameter 
elicitation is non-trivial (see Section 2.4). To further limit scope, we consider only the special case where 
A 1 = A 2 = • • • = A J := A. 

When ancillary information A 3 is available regarding a specific individual network G 3 , it is desirable to 
augment the prior specification in such a way as to condition upon A 3 . In general such modification will be 
application specific. 

Although we focus on SHD priors, the inference procedures presented in this paper apply to the more 
general class of modular priors, which may be written in the form 

d(G, G°) = d p(G P , G° p ), d 3 (G 3 ,G; A 3 ) = £ d p (G p , G p ; A 3 ) (3) 

p&V p€V 

for some functionals d p , d 3 p : Q p x Q p — > M. Modularity here refers to a factorization over variables p € V 1 
implying that only local information is available a priori. The SHD priors are clearly modular. 
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2.3 Two special cases: INI and ANI 



Up to inclusion of ancillary information, prior strength is fully determined, in this simplified setting, by 
the parameter pair (A, 77). Taking rj — > 00 requires that the latent network G is (almost surely) identical 
to the prior network G°; in the limit this corresponds to treating network inference for each individual 
separately i.e. the estimator G 3 = G(y°). We call this approach "independent network inference" (INI). 
Conversely, taking A — > 00 requires that (almost surely) individual networks G 3 do not deviate from the 
latent network G; this corresponds to assuming individuals have identical (unknown) network structure, but 
allowing parameter values 3 to vary between individials, possibly becoming equal to zero. We call this 
approach "aggregated network inference" (ANI). Taking A, 77 — > 00 together corresponds to using only the 
prior. A further, cruder, approach would be to simply combine all data in order to estimate a single network 
and parameter set, an approach which Werhli and Husmeier (2008) call "monolithic". We compare these 
approaches empirically in Section 4. 

2.4 Network prior elicitation 

Elicitation of hyperparameters for network priors is an important and non-trivial issue. Hyperparameters 
can be set using the data, but this poses a number of challenges, as reported in Dondclingcr et al. (2012); 
Pcnfold et al. (2012); Werhli and Husmeier (2008). In the context of sequential hierarchical network priors, 
Dondclingcr et al. (2012) observed that when there is limited data available, hyperparameters inferred from 
the data may be biased towards imposing too much agreement with the prior. Penfold et al. (2012) used 
an improper hyperprior over the individual inverse temperature parameters X 3 , reporting that for most 
individuals posterior marginals did not differ greatly from the prior (possibly due to uninformative data). 
Similarly Werhli and Husmeier (2008) assigned improper flat prior distributions over the hyperparameters, 
reporting that estimation was rather difficult. Due to such weak identifiability of hyperparameters, we chose 
instead to specify the hyperparameters A, 77 in a subjective manner. 

For subjective elicitation of network hyperparameters, interpretable criteria are important. We present 
three criteria below which, for the special case of SHD which we consider, are simple to implement and can be 
used for expert elicitation. These heuristics seek to relate the hyperparameters to more directly interpretable 
measures of the similarity and difference which they induce between prior, latent and individual networks. 

Firstly, we note the following formula for the probability of maintaining edge status (present/absent) 
between the latent network G and an individual network G 3 : 

h x := p(i i G 3 p AG p ) = e _ Ax e ^ X e °_ Axl = Y ^— x . (4) 

This probability provides an interpretable way to consider the influence of A. For example a prior confidence 
of h\ ~ 0.73 that a given edge status in G is preserved in a particular individual G 3 translates into a 
hyperparameter A ~ 1 (see SFig. 1). An analogous equation relates r\ and h v := p(i ^ G p AGp), allowing 
prior strength to be set in terms of the probability that an edge status in the prior network G° is maintained 
in the latent network G. 

A second, related approach is to consider the expected total SHD between an individual network G 3 and 
the wild type network G: 

E(SHD(G- J , G)) = P 2 (l - h x ) (5) 

This can be interpreted as the average number of edge changes needed to obtain G 3 from G. An analogous 
equation holds for r] and h v . 

Thirdly, in certain applications, the latent network G may not have a direct scientific interpretation, 
in which case the criteria presented above may be unintuitive. Then, hyperparameters could be elicited 
by consideration of (a) similarity between individual networks G 3 , G k , and (b) concordance of individual 
networks G 3 with the prior network G . Specifically, we suggest the following two-step procedure: (a) 
exploit the fact that (for an uniform prior on G) we have si := p(i ^ G 3 p AGp) = 1 — 2h\ + 2h\, which 
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facilitates selection of h\ via the formula h\ = (1 + \/2s\ — l)/2. (b) elicit h v using the observation that 
s 2 := p(i ^ G p AG p ) — 1 — h\ — h v + 2h\h v , so that h v — (s 2 + h v — l)/(2h\ — 1). This two-step procedure 
uniquely determines a pair (h v ,h\) 6 [0.5, l) 2 and hence unique hyperparameters (rj, A) € [0, oo) 2 . One 
drawback of this approach is that A is selected under an assumption of a uniform prior on G; that is, 77 = 0. 
The quality of this procedure will therefore depend on the actual informativeness ri of the prior network G° 
on G selected in step (b). This approach to hyperparameter selection has an analogous interpretation using 
expected total SHD. 

The above heuristics may be useful in setting hyperparameters in practice. However, these heuristics are 
certainly no panacea and should be accompanied by checks of sensitivity to hyperparameters, as we report 
below. 

3 Joint network inference for time-course data 

The JNI model and network priors, as described above, are general. To apply the JNI framework in a 
particular context requires an appropriate likelihood at the individual level, that is, to specify the distribution 
p(y^|G J ',0 J ') of data y J conditional on graph G- 7 and parameters OK In this Section we focus on time-course 
data, using DBNs to provide the likelihood. 

3.1 Dynamic Bayesian network formulation 

A DBN is a graphical model based on a DAG whose vertices have explicit time indices; see Murphy (2002) 
for details. Here, following Hill et al. (2012) and others, we use stationary DBNs and permit only edges 
forwards in time. Background and assumptions for DBNs are described in Appendix A. Further assuming 
a modular network prior, structural inference for DBNs can be carried out efficiently, as described in detail 
in Hill et al. (2012). A novel contribution of this paper is to extend these results to allow for efficient and 
exact joint estimation. In order to simplify notation, we define a data-dependent functional 

m 

¥(X)=p(X(l))l[p(X(i)\y(i-l)) (6) 

i=2 

which implicitly conditions upon observed history. Let y° p {t) denote the observed value of variable p in 
individual j at time t. The above notation allows us to conveniently summarize the product 

p(y p (l)\G p )p(y p (2)\y(l),G p ) . . .p(y p (m)\y(m - 1), G j p ). (7) 

as ^(y^\G J p ). Thus, we have that, for DBNs, the full likelihood also satisfies modularity: 

p{y\G\...,G J ) = ]Jl[W p \G p ) (8) 

j£j p<EV 

In other words, the parent sets G J p (p € V, j G J) are mutually orthogonal in the Fisher sense, so that 
inference for each may be performed separately. 

For this paper, the local Bayesian score ty(y p \G p ) corresponds to the marginal likelihood for a linear 
autoregressive formulation described in Appendix B. We consider an extension to facilitate the analysis of 
datasets which contain interventions; this is described in Appendix C. For this choice of model it is possible 
to construct a fully conjugate set of priors, delivering a closed form expression for the local score, contained 
in Appendix D. 

3.2 Computationally efficient joint estimation 

Previous studies have used MCMC to generate samples from the posterior distribution over networks (Penfold 
et at, 2012; Werhli and Husmeier, 2008). However, ensuring mixing has proven to be extremely challenging 
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for joint estimation, with both studies reporting extremely slow convergence. Advances in MCMC and 
parallel computing may in the future ameliorate these issues (Lee et at, 2010), but at present it remains 
the case that fast, interactive joint estimation is currently challenging or prohibitively demanding using 
MCMC. We therefore propose an exact approach, using an in-degree restriction coupled with prior modularity 
and a sum-product-type algorithm, to facilitate efficient estimation. For example, the DREAM4 problem 
(P = 10 variables, J = 5 individuals) considered by Pcnfold et al. (2012) was reported to require "several 
hours per node" for MCMC convergence; our approach solves the entire problem in w 3 seconds. Our 
approach therefore complements MCMC-based inference, allowing fast, interactive investigation in moderate- 
dimensional settings. 

Specifically, we use exact model averaging to marginalize over graphs and report posterior marginal 
inclusion probabilities. We begin by computing and caching the marginal likelihoods |G£) for all parent 
sets G p £ G P , all variables p £ V and all individuals j £ J; these could be obtained using essentially any 
suitable likelihood. The posterior marginal probability for an edge (i,p) belonging to the latent network G 
is computed as 

p(i£G p \y,G°) = UeG P p(G P \y P ,G° p ) ( 9 ) 

= E ^g p E p(G 1 p ,...,G J p: G p \y p ,G° p ) (10) 

G P £G P G p eg p :]£j 

« E E ^(y P \Gl,...,G J p ,G p ,G° p }p(G 1 p ,...,G^G p \G p ) (11) 

G P eg P G p eg p :j£j 

= E E p(G P \G p )]Jy(y p \G p )p(G P \G p ) (12) 

= E ^e Gp p(G P \G° p ) Yl U^(yp\ G M G i\ G p) ( 13 ) 

G P eg P G 3 p eg p :j£j o&J 

= £ l teGpP {G p \G a p ) JJ ]T ¥(y p \G p ) P (G p \G p ) (14) 

G P eg P G p eg P 

where Eqn. 14 uses the sum-product lemma (Kschischang et al., 2001) to interchange operators (see Appendix 
E). This final step has important consequences for algorithmic complexity (see Section 3.3). Note that, whilst 
this derivation can made without the explicit marginalization of Eqn. 10, the approach is quite general and 
may be used analogously to facilitate estimation of individual networks G J : 



p(i£Gl\y,G°) 



oc 



E heGiPHlVv^D (15) 
GpeS P 

E heGi E E p(Gl...,G J pl G p \y p ,Gl) (16) 

G p G5p G p es p c>eg p -.kej\{j} 

E ^eGp E E p{Vp\ G pi ■ ■ ■ i G pi G P> G p)p( G pi ■ ■ ■ i G pi G p\ G p) (17) 

GpGSp G p eg pGp ]eg P :kej\ti} 

E ^ect E E p(G p \G° p )l[y(y p \G p )p(G p \G p ) (18) 

GpGSp G p eg p G k p eg p -.kej\U} lej 

E ^Gp E p{ G p\ G imvi\ G i)p{ G i\ G p) E n ytfp\& P )p(& P m) 

G P cg p G p eg p G*eg p -.k&j\{j} ieJ\{j} 

E ^Gp E P( G p\ G lMvi\ G i)p( G i\ G p) II E Wv'MMCplCp) (20) 
GpeSp g p^p kej\{j}G p eg P 
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where again the sum-product lemma justifies the exchange of operators. 
3.3 Computational complexity 

Following Hill et al. (2012) we reduced the space of parent sets Q p using an in-degree restriction of the form 
\Gp\ < c for all Gl G G P , P G V , j G J . Thus the cardinality of the space of parent sets M = \Q p \ = 0{P C ) is 
polynomial in P, where it was previously exponential. This reduces summation over an exponential number 
of terms to a more manageable sum over polynomially many terms. Moreover, in the protein signaling 
example to follow, bounded in-degree is a reasonable biological assumption. Sensitivity to choice of c is 
discussed in Section 4.1. 

Caching of selected probabilities is used to avoid redundant recalculation. Pseudocode is provided in 
Appendix F, which consists of three phases of computation. Storage costs are dominated by Phases I and II, 
which each requiring the caching of O(PJM) real numbers. Phase II dominates computational effort, with 
total (serial) algorithmic complexity 0(PJ 2 M 2 ). However, within-phase computation is "embarrassingly 
parallel" in the sense that all calculations are independent (indicated by square parentheses notation in the 
pseudocode). Thus an ideal implementation requires 0(3) computational time. We provide a MATLAB 
implementation in Supplement B. 

4 Results 

We tested our joint estimation procedure on simulated time-course data. We compare our approach to the 
special cases of (i) inferring each network separately (INI); (ii) allowing parameters (but not networks) to 
change between individuals (ANI); (iii) the naive approach of aggregating all data (monolithic) and (iv) 
simple temporal correlations (absolute Pearson coefficient). For a fair comparison, all methods, with the 
exception of (iv) , were implemented so as to take account of the interventional nature of the data. We note 
that it is not possible to directly compare our results with Danahcr et al. (2012); Guo et al. (2011); Yang et 
al. (2012) since these methods do not apply to time-course data. The method of Penfold et al. (2012) applies 
to time-course data, but the computational demands of the approach precluded application in this setting. 
Specifically, in the simulated data example we report below, over 3000 rounds of inference were performed 
in total, on problems larger than DREAM4 (P = 10, J = 5). Using the approach of Penfold et al. (2012), 
these experiments would have required more than 10 years computational time; in contrast our approach 
required less than 24 hours serial computation on a standard laptop. 

4.1 Performance metrics 

The proposed methodology addresses three questions, some or all of which may be of scientific interest 
depending on application; (i) estimation of the latent network G, (ii) estimation of individual networks 
G 1 ,...,G J , and (iii) estimation of differences between individual networks. We quantify performance for 
tasks (i) and (ii) using the area under the receiver operating characteristic (ROC) curve (AUR). This metric, 
equivalent to the probability that a randomly chosen true edge is preferred by the inference scheme to a 
randomly chosen false edge, summarizes, across a range of thresholds, the ability to select edges in the 
data-generating network. AUR may be computed relative to the true latent network G, or relative to 
the true individual networks G- 7 , quantifying performance on tasks (i) and (ii) respectively. Both sets of 
results arc presented below, in the latter case averaging AUR over all individual networks. For (iii), in 
order to assess ability to estimate individual heterogeneity, we computed AUR scores based on the statistics 
Hp = IK* e G 3 p \y, G°) - p(i e G p \y, G°)| which should be close to one if i € G p AG p , otherwise F( p should 
be close to zero. 

It is easy to show that inference for the latent network, under only the prior, attains mean AUR equal to 
h n . Similarly, prior inference for the individual networks attains mean AUR equal to 1 — h n — h\ + 2h r) h\. 
This provides a baseline for the proposed methodology at tasks (i) and (ii) and allows performance to be 
decomposed into AUR due to prior knowledge and AUR contributed through inference. Using a systematic 
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variation of data-generating parameters, we defined 15 distinct data generating regimes. For all 15 regimes we 
considered 50 independent datasets; standard errors accompany average AUR scores. Results presented below 
use a computationally favorable in-degree restriction c = 3. Note that when the maximum in-degree of any 
of the true networks exceeds the computational restriction c, estimator consistency will not be guaranteed. 
In order to check robustness to c, a subset of experiments were repeated using c = 4, with close agreement 
observed (SFig. 4). 

4.2 Data generation 

A latent network G on P vertices was drawn from the Erdos distribution with edge density p/P. In order 
to simulate heterogeneity, the individual networks G J were obtained from G by maintaining the status 
(present/absent) of each edge independently with probability h\. A parameter (3j p for each parent i e G J p 
was independently drawn from the mixture normal distribution 0.5A/"(— 1, 0.1 2 ) + 0.5A/"(1, 0.1 2 ) (the mixture 
distribution ensures that parameters are not vanishingly small, so that the structural inference problem is 
well-defined) . Collecting together parameters produces matrices ft , corresponding to networks G J via i € G° p 

if and only if f3f p ^ 0. We also generate, for each individual j, intercept parameters a? ~ N(0p, Ipxp) 
representing baseline expression levels. Initial conditions were sampled as y-'(l) ~ N(0p, Ip x p)- Data were 
then generated from the autoregressive model y J (i) = a? + y 5 (t — I) ft +e t , where e t ~ N(0p,<r 2 Ip x p) are 
independent for t = 2, . . . , n. In this way E such time courses were obtained; that is, from E distinct initial 
conditions, so the total number of data for individual j is rij = En. In order to avoid issues of blow-up 
and to generate plausible datasets, the matrices ft were normalized by their spectral radii prior to data 
generation. 

In order to investigate the effect of using a prior network G°, we do not simply want to set G° equal to 
the latent network G, since in practice this network is unknown. We therefore generated a prior network G° 
by correctly specifying each potential edge as either present or absent with probability h v . In this way we 
mimic partial prior knowledge of the networks under study. 

4.3 Alternative data generating mechanisms 

We augmented the above data-generating scheme to mimic interventional experiments. In this case, for each 
time course, a randomly chosen variable is marked as the target of an interventional treatment. Data are 
then generated according to the augmented likelihood described in Appendix C (fixed effects were taken 
to be zero). Furthermore, in order to investigate the impact of model misspecification, we also considered 
time series data generated from a computational model of protein signaling, based on nonlinear ODEs (Xu 
et at, 2010). In order to extend this model, which is for a single cell type, to simulate a heterogeneous 
population, we randomly selected three protein species per individual and deleted their outgoing edges in 
the data-generating network (see Supplement A). 

4.4 Latent network 

Firstly we investigated ability to recover the latent network G. Initially all estimators are assigned approx- 
imately optimal hyperparameter values rj — 1,A — 4 (for Xu et al. (2010), A = 3) based on the heuristic of 
Eqn. 4; prior misspecification is investigated later in Section 4.7. We found little difference in the ability 
of JNI and ANI to recover the latent network structure across a wide range of regimes (STable 1). Since 
ANI enjoys favorable computational complexity, this estimator may be preferred for this task in practice. 
However, both approaches clearly outperformed monolithic inference, which was no better than inference 
based on the prior alone, demonstrating the importance of accounting for variation in parameter values. 
Correlations barely outperformed random sampling. 

In practice, one could also estimate G using independent network inference (INI), via the ad hoc estimator 
p(i G G p \y, G°) ~ j J2jejP^ e G p \y 3 , G°) which performs an unweighted average of J independent network 
inferences. However we found that INI offered no advantage over JNI and ANI, performing worse than both in 
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14 out of 15 regimes. We obtained qualitatively similar results for both alternative data-generating schemes 
(STables 3,6). 

4.5 Individual networks 

Secondly we investigated the ability to recover individual networks G J . At this task, JNI outperformed INI 
in all 15 regimes (Table 1). This demonstrates a substantial increase in statistical power resulting from the 
hierarchical Bayesian approach. JNI also outperformed monolithic estimation and inference using temporal 
correlations in all 15 regimes, with the latter demonstrating substantial bias. 

One may try to improve upon INI by firstly estimating the wild type network G, and then taking this 
estimate as a prior network G° within a second round of INI. Informed by Section 4.4, we consider the 
approach whereby G is first estimated using ANI, referring to this two-step procedure as "empirical network 
inference" (ENI). We found that the performance of ENI consistently matched that of JNI over a wide range 
of regimes. Since ENI avoids all joint computation, this may provide a practical estimator of individual 
networks in higher dimensional settings. Similar results were observed using the alternative data-generating 
schemes, although JNI slightly outperformed ENI on the Xu et al. (2010) datasets (STables 4,7). 

4.6 Feature detection 

Thirdly, we assessed ability to pinpoint sources of variation within the population. Interest is often directed 
toward individual-specific heterogeneity, or features. Informally, writing G- 7 = G + , features correspond 
to 8 3 . JNI regularizes between individuals; it therefore ought to eliminate spurious differences, leaving only 
features which are strongly supported by data. Equivalently, since JNI offers improved estimation of the 
latent network G, the features <P = G 3 — G ought also to be better estimated. 

Feature detection may also be performed using INI or ENI, comparing an latent network estimator (see 
ad hoc estimator in Section 4.4) with individual networks. The performance of JNI was compared to the 
performance of INI and ENI (STable 2). We found that, whilst feature detection is much more challenging 
that previous tasks, JNI mostly outperformed both INI and ENI, with exceptions occurring whenever the 
underlying dataset was highly informative (in which case INI was often superior). This suggests that co- 
herence of the JNI analysis aids in suppressing spurious features in the small sample setting. Alternative 
data-generating schemes produced qualitatively similar results, although JNI outperformed ENI on the Xu 
et al. (2010) datasets (STables 5,8). 

4.7 Robustness to hyperparameter misspecification 

For the above investigation we used Eqn. 4 to elicit hyperparameters A, r\. This was possible because the 
data-generating parameters h\, were known by design; however in general this will not be the case. It is 
therefore important that estimator performance does not deteriorate heavily when alternative hyperparam- 
eter values are employed. By fixing (h\,h v ) in the data generating process, we are able to investigate the 
robustness of JNI estimator to hyperparameter misspecification. In particular, when finite values are ascribed 
to data-generating parameters (h v , h\), ANI and INI may be interpreted as inference using misspecified prior 
distributions (see Section 2.3). 

SFig. 3 displays how performance of the JNI estimator for latent networks depends on the choice of 
hyperparameters A, rj g [0, oo). We notice that AUR remains close to that obtained for optimal A, 7/ over a 
fairly large interval, so that performance is not exquisitely dependent on prior elicitation. 
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4.8 Robustness to outliers and batch effects 

The biological datasets which motivate this study often contain outliers. At the same time, experimental 
design may lead to platform-specific batch effects. In order to probe estimator robustness, we generated 
data as previously described, with the addition of outliers and certain batch effects. Specifically, Gaussian 
noise from the contamination model 0.95A/"(0, 0.1 2 ) + 0.05A/"(0, 10 2 ) was added to all data prior to inference. 
At the same time, one individual's data were replaced entirely by Gaussian white noise to simulate a batch 
effect that could arise if preparation of a specific biological sample was incorrect. The relative decrease in 
performance at feature detection is reported in SFig. 5. We found that JNI remained the optimal estimator 
for all three estimation problems, in spite of these heavy violations to the modeling assumptions. However, 
the actual decrease in performance was more pronounced for JNI than for INI, suggesting that decoupled 
estimation (INI) may confer robustness to batch effects which affect single individuals. 

5 Discussion 

There are three distinct, though related, structure learning problems which may be addressed in the context 
of an heterogeneous population of individuals: 

1. Recovering a shared or "wild type" network from the heterogeneous data. 

2. Recovering networks for specific individuals. 

3. Pinpointing network variation within the population. 

Each problem may be of independent scientific interest, and the joint approaches investigated here address 
all three problems simultaneously within a coherent framework. We considered simulated data, with and 
without model misspecification. For all three problems we demonstrated that a joint analysis performs at 
least as well as independent or aggregate analyses. 

Our analysis, based on exact Bayesian model averaging, was massively faster then the sampling-based 
schemes of Penfold et al. (2012); Werhli and Husmeier (2008). Moreover, our estimators are deterministic, 
so that difficulties pertaining to MCMC convergence were avoided. Indeed, attaining convergence on joint 
models of this kind appears to be challenging (Werhli and Husmeier, 2008). The proposed methodology 
is scalable, with an embarrassingly parallel algorithm provided in Section 3.3. Furthermore, we described 
approximations to a joint analysis which enjoy further reduced computational complexity whilst providing 
almost equal estimator performance across a wide range of data-generating regimes. 

Whilst we considered the simplest form of regularization, based on prior modularity, there is potential 
to integrate richer knowledge into inference. One possibility would be hierarchical regularization that allows 
entire pathways to be either active or inactive. However, in this setting it would be important to revisit 
hyperparameter elicitation; the procedures which we have described are specific to SHD priors. In particular 
we restricted the joint model to have equal inverse temperatures A 1 = • • • = X J := A. Relaxing this assump- 
tion may improve robustness to batch effects which target single individuals, since then weak informativeness 
(A- 7 « 0) may be learned from data. It would also be interesting to distinguish between G \ G- 7 ("loss of 
function") and G J \G ("gain of function") features. In this work we did not explore information sharing 
through parameter values 9 J , yet this may yield more powerful estimators of network structure in settings 
where individuals' parameters 0^,6 are not independent. 

The JNI model could be formulated as a penalized (log-)likelihood 

log(p(y\G\ . . . , G J )) - Xd j ((P, G; A°) - V d(G, G°). (21) 

The frequentist approaches described by Danahcr et al. (2012); Guo et al. (2011); Yang et al. (2012) enjoy 
favorable computational complexity (esp. Danahcr et al. (2012) who provide an example with P = 22, 283 
variables and J = 187 individuals). However, in small to moderate dimensional settings, the Bayesian 
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methods proposed here are complementary in several respects: (i) Bayesian approaches provide a confidence 
measure for inferred topology, dealing with non-identifiable and multi-modal problems; (ii) no convexity 
assumptions are required on the form of the penalty functions d, d? in the Bayesian setting, which may assist 
with integration of ancillary information; (iii) the above penalized likelihood methods do not apply directly 
to time course data (but could be extended to do so). 

These experiments employed a promising formulation of likelihood under intervention due to Spencer and 
Mukherjee (2012). There are a number of interesting extensions which may be considered in future work: (i) 
In high dimensions, Bayesian variable selection requires multiplicity correction in order to avoid degeneracy 
(Scott and Bcrgcr, 2010). Such correction is required to control the false discovery rate and is independent 
to the penalty on model complexity provided by the marginal likelihood. In this moderate-dimensional 
work, in order to simplify the presentation, we did not employ a multiplicity correction; this should be an 
avenue for future development, (ii) Inference was based upon a local score borrowed from Bayesian linear 
regression. We chose to employ the g-prior due to Zellner (1986), where following George and Foster (2000) 
we used (conditional) empirical Bayes to select the g hyperparameter. Others have suggested setting g = n 
(unit information prior; Smith et al., 2001), whilst Dcltell et al. (2012) and Liang et al. (2008) propose prior 
distributions over g with attractive theoretical properties. Our empirical investigation suggested that the 
choice of hyperparameter elicitation is influential, but a thorough comparison of linear model specifications 
is beyond the scope of this paper, (iii) As discussed in Oates and Mukherjee (2012), linear autoregressive 
formulations may be inadequate in realistic settings; in particular, samples which are obtained unevenly 
in time can be problematic. Recent advances which incorporate mechanistic detail into the likelihood may 
prove advantageous (Oates et al., 2012). Since the JNI approach decouples the marginal likelihood and model 
averaging computations, it may be applied directly to the output of more sophisticated models, (iv) In the 
case of linear models, Barbieri and Berger (2004) showed that the median probability model (i.e. model 
averaging) provides superior predictive performance over the maximum a posteori (MAP) model. However 
we are unaware of an analogous result for causal inference in the Bayesian setting. 

Techniques for modeling heterogeneous data are clearly widely applicable. The methodology presented 
here may be applicable in other disciplines. For example, our approach is suited to meta-analyses of network 
analyses (Weilc et al., 2012), integration of multiple data sources (Kato et al., 2005; Wei and Pan, 2012; 
Wcrhli and Husmeier, 2008) or data arising from context dependent networks (Baumbach et al., 2009). The 
ideas discussed here share many connections with time-heterogeneous DBNs which, for brevity, we did not 
discuss in this paper (Dondelinger et al., 2010, 2012; Grzegorczyk and Husmeier, 2011; Song et at, 2009). 
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Appendix A: Dynamic Bayesian networks 

DBNs have emerged as popular tools for the analysis of multivariate time course data due to (i) the fact that 
no acyclicity assumption is required on the (static) network, and (ii) computational tractability resulting 
from a factorization of the likelihood function over variables p £ V (Hill et al., 2012). For the DBNs used 
here, an edge (p, q) from p £ V to q £ V in G 3 £ Q means that y 3 q {t), the observed value of variable q in 
individual j at time t, depends directly upon y 3 p [t — 1), the observed value of p in individual j at time t — 1 
(Fig. 3(a); note that t indexes sample index, rather than actual sampling time). Let y J denote a vector 
containing all observations for individual j. Then y 3 (t) is conditionally independent of {y 3 (t — t) : r > 2} 
given y 3 (t — 1) and G 3 (first-order Markov assumption). These conditional independence relations are 
conveniently summarized as a (static) network G 3 with exactly P vertices (Fig. 3(b)); note that this latter 
network need not be acyclic. 
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time t — 1 time t 




(a) G J ; DBN representation. (b) G J ; "static" representation. 



Figure 3: Dynamic Bayesian networks (DBNs). (a) An "unrolled" dynamic Bayesian network (DBN) showing 
each variable at successive time points, (b) The corresponding "static" representation of DBN (a) with 
exactly one vertex for each variable. 

Appendix B: Local likelihood for time course data 

We follow Aliferis et al. (2010); Hill et al. (2012); Pcnfold et al. (2012) in formulating inference in DBNs as 
a regression problem. We entertain models for the response y 3 p {t) as predicted by covariates yi{t — 1). In 
many cases multiple time series will be available. In this case the vector y p contains the concatenated time 
series. The DBN formulation gives rise to the following linear regression likelihood 

y{ = X a + X j G{ (5 + e (22) 

where e ~ iV(0„ x i, a 2 I nxn ). The matrix Xg = [l{ f= ij l{t>u]nx2 contains a term for the initial time 
point in each experiment. The elements of -X^,- corresponding to initial observations y p (l) are simply set 
to zero. Parameters 9 p = {a, (3, a} are specific to model G 3 p , variable p and cell line j. In the simplest case 
the model-specific component X J , of the design matrix consists of the raw predictors y L 3 (t — 1) where y\ 

denotes the elements of the vector yi(t — 1) belonging to the set A, though more complex basis functions 
may be used. 

Appendix C: Modeling interventions 

Following Eaton and Murphy (2007); Spencer and Mukherjee (2012) we model interventional data by modifi- 
cation to the DAG in line with a causal calculus (Pearl, 2009). We mention briefly some of the key ideas and 
refer the interested reader to the references for full details. A "perfect intervention" corresponds to 100% 
removal of the target's activity with 100% specificity. In the context of protein phosphorylation, kinases may 
be intervened upon using agents such as monoclonal antibodies, small molecule inhibitors or even si-RNA 
(Lu et al., 2011). We make the simplifying assumptions that these interventions are perfect and use the 
"perfect out fixed effects" (POFE) approach recommended by Spencer and Mukherjee (2012). We refer 
the reader to Spencer and Mukherjee (2012) for an extended discussion of POFE. This changes the DAG 
structure to model the intervention and also estimates a fixed effect parameter to model the change under 
intervention in the log-transformed data. 
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Appendix D: Exact marginal likelihood 

We employed a Jeffreys prior p(a, cr\G p , <j) J ) oc 1/a for a > over the common parameters. Prior to infer- 
ence, the non-interventional components of the design matrix were orthogonalized using the transformation 
(XL)tt h> J2i(In ~ P Q )u{X j nj ) lk , where P = X (X£ Xo)- 1 X$ (DelteU et at, 2012). We then assumed 

Up Up 

a g-prior for regression coefficients (Zellner, 1986), given by /3|a, a, G p , <]>> ~ AT(0(, x i, $ a 2 {X^ X Q j ) _1 ) 

where 6 = dim(/3). Using these priors for the DBNs with intervention as outlined above, the marginal 
likelihood can be obtained in closed-form: 

WM> ^ « (^IF 5 if ( w " Po " ^tt Pg y 2 (23) 

where i-^j = -X^ (X^Xq, )~ 1 X^ j , a — dim(a) and b = dim(/3). Empirical investigations have previously 

demonstrated good results for network inference based on the above marginal likelihood (Hill et at, 2012; 
Spencer and Mukherjee, 2012). Following George and Foster (2000) we used the (conditional) empirical 
Bayes approach to determine the J hyperparameter (details in Supplement A). 

Appendix E: The sum-product lemma 

The "sum-product" lemma, which forms the basis for several exact inference procedures in graphical models, 
can be expressed in its most basic form as follows: For a finite set of functionals fi : Xi — > R on finite 
domains <pi indexed by 1 < i < I we have 

e n^)=nE/>(4 ( 24 ) 

The proof is straight forward (induction on I) and can be found in e.g. Kschischang et al. (2001). The 
sum-product lemma is typically used to reduce algorithmic complexity, replacing the OQXi \ X ■ ■ ■ x x I) 
expression on the left hand side by the C ( | A"i | + • • ■ + \Xj\) expression on the right hand side. 

Appendix F: Joint network inference - pseudocode 

This appendix contains pseudocode for exact joint model averaging. [Computational complexity of calcu- 
lating marginal likelihoods ^(yp\G p ) will scale with sample size n; scaling exponents shown here assume 
0(n) = 0(1).] Below we provide pseudocode for computation of posterior marginal inclusion probabilities 
for edges in individual networks G J : 

for all p G V do 

Phase I: 

Compute and cache [Vp G V] [Vj G J] [VG p G Q p ] 
W P \G P ) = E Gi eg p W P \G p ) P (G p \G p ) p{M)] 
Phase II: 

Compute and cache [Vp G V] [Vj G J] [VG£ G Q p ] 

P (G p \y p ,G° p ) cx E Gpe ^P(G P |G0)q3(^|G^p(G^|G p )n fee ^ {j} ^(y||G p ) [O(MJ)} 
Phase III: 

Compute and cache [Vp G V] [Vj G J] [Vz G V\ 

p(i G G£|y,G°) = E G j eSp h eG ,p(G p \y,G° p ) [0(M)\ 

end for 

Pseudocode for inference of the latent network G proceeds analogously. 
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